Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making UDS error handling and recovery more robust. #189

Merged
merged 18 commits into from
Oct 1, 2020

Conversation

dhermes
Copy link
Contributor

@dhermes dhermes commented Sep 25, 2020

Fixes #128.

@dhermes
Copy link
Contributor Author

dhermes commented Sep 25, 2020

Some notes:

  • I've been using https://github.com/dhermes/local-dd-agent to start and stop a local dd-agent
  • I fully expect the maintainers to want more quality control on the (fairly large) change set I dropped from thin air. Please feel free to pick it apart.
  • I hope the commits have broken it up in a way that's easy to digest.

});
}

function trySetNewSocket(client) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very open to a naming convention other try in trySetNewSocket()

}
}

function maybeAddUDSErrorHandler(client, lastSocketCreateTime) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very open to a naming convention other than maybe in maybeAddUDSErrorHandler()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get too hung up on names, unless I think it doesn't fit the style. Not too much of a naming style in here so it's fine. :) Spent a bunch of time making sure I understood the new paths in here, for this function and others. I didn't see any issues, looks good to me.

@@ -526,3 +502,59 @@ Client.prototype.childClient = function (options) {

exports = module.exports = Client;
exports.StatsD = Client;

function udsErrorHandler(client, lastSocketCreateTime) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not sure if you want these handlers here at the end
  2. I'm not sure if you'd prefer client be this and we just put these helpers on the Client prototype

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine as is. Great to see the nice docs there.

@@ -354,7 +321,16 @@ Client.prototype.sendMessage = function (message, callback) {
return;
}

if (!this.socket) {
const socketWasMissing = !this.socket;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be considered a breaking change that I've added an attempt to recover here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks. Not sure yet if this will be a minor or major release, but I think this part by itself would fit in ok with what has been minor releases.

@@ -48,6 +51,8 @@ const Client = function (host, port, prefix, suffix, globalize, cacheDns, mock,
this.cacheDnsTtl = options.cacheDnsTtl || CACHE_DNS_TTL_DEFAULT;
this.host = options.host || process.env.DD_AGENT_HOST;
this.port = options.port || parseInt(process.env.DD_DOGSTATSD_PORT, 10) || 8125;
this.path = options.path;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I've added four public fields to Client (path, stream, udsGracefulRestartRateLimit and isChild), this may not be a desirable addition to the "exported" interface
  • I haven't updated the TypeScript definition(s) to reflect new exports / fields

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, and you shouldn't add these to TypeScript definitions, others ones aren't mentioned in there like this (I don't think)

@dhermes
Copy link
Contributor Author

dhermes commented Sep 25, 2020

To verify this change I've been running sent-to-uds.js

const process = require("process");

const hotShots = require("./index");

const DATADOG_ADDRESS = process.env.DATADOG_ADDRESS;

function toPath(datadogAddress) {
  if (datadogAddress.startsWith("unix://")) {
    // NOTE: The length of the "unix://" prefix is 7.
    return datadogAddress.slice(7, datadogAddress.length);
  }

  return datadogAddress;
}

function createUDSClient() {
  const opts = {
    mock: false,
    maxBufferSize: 0,
    path: toPath(DATADOG_ADDRESS),
    protocol: "uds",
    udsGracefulErrorHandling: true,
    udsGracefulRestartRateLimit: 1000,
  };
  return new hotShots.StatsD(opts);
}

function makeEventLoop(udsClient) {
  return function compute() {
    console.log(
      JSON.stringify({
        message: "Doing very big computation",
        level: "info",
        _timestamp: new Date(),
      }),
    );
    udsClient.increment("computation.uds", 1, 1, []);
  };
}

async function main() {
  const udsClient = createUDSClient();
  const compute = makeEventLoop(udsClient);
  setInterval(compute, 500);
}

if (require.main === module) {
  main();
}

via

SOCKET_DIR="${HOME}/var-run-datadog"
DATADOG_ADDRESS="unix://${SOCKET_DIR}/dsd.socket" node ./sent-to-uds.js

and then turning on and off the local-dd-agent (https://github.com/dhermes/local-dd-agent) via

make run DATADOG_ADDRESS="${SOCKET_DIR}/dsd.socket"

@bdeitte
Copy link
Collaborator

bdeitte commented Sep 26, 2020

@dhermes Thanks for the PR and all of the info! Neat to see local-dd-agent too. I see the tests are still failing in here, looks like something isn't going right on the Mac version. Also as noted in #188 looks like we have some overlap here, haven't had a chance yet myself to look over everything and understand.

@dhermes
Copy link
Contributor Author

dhermes commented Sep 26, 2020

Yup; I haven't had a chance to parse the test failures but will attempt to tackle it next week. I didn't want to devote too much time building a cathedral before getting sign off from maintainers.

Once / if you think this looks promising I can start to tackle the test failures? I'd love an assist if the Travis CI failures look "easy" to tackle for you (but by all means not needed).

Also, since there are 4 distinct issues, I can send 4 different PRs if you'd rather (the commits themselves in this PR can be broken up nicely to fix each of the 4 issues separately).

if (process.platform === 'linux') {
return [
os.constants.errno.ENOTCONN,
os.constants.errno.ECONNREFUSED,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I didn't realize there was a constant here available on Node 8 and above. Good to know.

@bdeitte
Copy link
Collaborator

bdeitte commented Sep 27, 2020

I took a pretty quick pass on this, but great to see all the work done and makes general sense to me. When the tests are passing (and I have a little more time) I'll take another look at this and see if I have any more feedback. This UDS handling has been looking ugly for awhile (from what I can gather from the bug reports), glad to see someone take it all on and make it work better.

I may be able to help with the tests here at some point (do run a Mac here) if you need it, just been short on time recently. It's not clear to me right away what's going wrong in there- oddly enough it's a mix of all sorts of different tests failing in there (https://travis-ci.org/github/brightcove/hot-shots/jobs/730370815#L1518).

Danny Hermes added 17 commits September 28, 2020 09:40
…` in constructor.

Also using these in places where `options` was used.
This way we can retry using this helper.
This is in advance of a change that will allow these to be OS specific.
See https://nodejs.org/docs/latest-v12.x/api/errors.html
for an example (in `error.errno`) explaining why error codes are
negative.
This way `this.socket` is only replaced if `createTransport()`
returned an actual socket (vs. `null`).
See: https://travis-ci.org/github/brightcove/hot-shots/jobs/730354851

This was due to `eslint` failures. Also changed some `==` to `===`
and added / removed some `;` in lines that needed / didn't need.
This "bug" was introduced to my bugfix in a re-factor.
@dhermes
Copy link
Contributor Author

dhermes commented Sep 30, 2020

@bdeitte Tests are green here so should be good to review.

@bdeitte
Copy link
Collaborator

bdeitte commented Oct 1, 2020

Awesome. I spent some time with tonight and don't see any issues. Thanks again for this! Going to release in a minor release here in a bit.

@bdeitte bdeitte merged commit 907dd57 into brightcove:master Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle issues with udsGracefulErrorHandling
2 participants